This is an Exploratory Data Analysis of the individual contributions made to 2016 Presidential Candidates from the state of Ohio. Ohio is known for being a battleground state, and I expect it will provide some interesting insight. Additionally, it is the state I most closely relate to as my home state.
Within this analysis, we will seek to answer some general questions:
- What is the total amount of contributions?
- Which candiate had the highest contributions?
- How do the contributions differ between the Primary & General Elections?
- What is the geographic distribution of the contributions across the state?
- Is there an identifiable pattern to the contribution dates as related to important dates during the campaign?
The data source can be found here: FEC Contributions to All Candidates
Before we can dive into making plots and answering questions, we first need to get an overall feel for the data.
There are 167,259 rows, and 18 columns. The only quantitative varible within the set is contb_receipt_amt, all others are categorical.
Of note, the contb_receipt_dt column is not formatted as a date data type, there is no party affiliation and no marker for whether a candiate went on to become the nominee. We will add those in shortly.
Important variables:
## 'data.frame': 167259 obs. of 18 variables:
## $ cmte_id : chr "C00580100" "C00580100" "C00580100" "C00580100" ...
## $ cand_id : chr "P80001571" "P80001571" "P80001571" "P80001571" ...
## $ cand_nm : chr "Trump, Donald J." "Trump, Donald J." "Trump, Donald J." "Trump, Donald J." ...
## $ contbr_nm : chr "SELL, GREG" "SELLE, JOAN" "SELLERS, JES" "ROOTRING, BEAU" ...
## $ contbr_city : chr "CLAYTON" "MENTOR" "CLEVELAND HTS" "CINCINNATI" ...
## $ contbr_st : chr "OH" "OH" "OH" "OH" ...
## $ contbr_zip : int 45315 44060 44106 45208 44721 432141210 441071232 432022420 450365038 45249 ...
## $ contbr_employer : chr "INFORMATION REQUESTED" "RETIRED" "INFORMATION REQUESTED" "BGR CONSUMER UNDERSTANDING, LLC" ...
## $ contbr_occupation: chr "INFORMATION REQUESTED" "RETIRED" "INFORMATION REQUESTED" "MARKET RESEARCH" ...
## $ contb_receipt_amt: num 97.1 53.5 69.4 88.4 -80 ...
## $ contb_receipt_dt : chr "23-SEP-16" "01-SEP-16" "17-OCT-16" "15-NOV-16" ...
## $ receipt_desc : chr "" "" "" "" ...
## $ memo_cd : chr "X" "X" "X" "X" ...
## $ memo_text : chr "" "" "" "" ...
## $ form_tp : chr "SA18" "SA18" "SA18" "SA18" ...
## $ file_num : int 1146165 1146165 1146165 1146165 1146165 1091718 1144564 1077404 1091718 1077404 ...
## $ tran_id : chr "SA18.102871" "SA18.165861" "SA18.143767" "SA18.110145" ...
## $ election_tp : chr "G2016" "G2016" "G2016" "G2016" ...
There are 24 distinct candidates within this data set.
## [1] 24
The contb_receipt_amt value has some outliers, which we’ll use the federal contribution rules to normalize.
Contribution rules to Candidate Committees by individuals:
Source: FEC | Contribution limits for 2015-2016
To do so, we’ll need to group by election type, candidate then contributor name before removing the extra contributions
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -10800.0 16.0 29.0 119.3 80.0 29100.0
After grouping, we are still left with some outliers
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -8600.0 130.0 306.6 559.5 695.0 29100.0
After removing the outliers, we get a mean of $122.26 and a Median of $30.00. Indicating that the while the average is 122, the common donated value is $30.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.08 19.00 30.00 122.26 80.00 2700.00
Looking at the overall distribution, a simple histogram shows a reasonable amout of spread, though it’s positively skewed towards the lower donation amounts which follows logically from the data summary.
Chopping up the contribution amounts into smaller buckets will give us a better overall sense of the contributions.
I’m choosing $0-15,$15-30,$30-45,$45-60,$60-100, $100-200, $50-1000 and $1000-2700.
As we would expect, the marjority of people are making smaller donations. After 100 the number sharply falls off until getting into the larger ranges of 200-500.
## buckets
## (0,15] (15,30] (30,45] (45,60]
## 39298 43724 12412 22221
## (60,100] (100,200] (200,500] (500,1e+03]
## 23434 7240 10920 2744
## (1e+03,2.7e+03]
## 3133
Now, looking into the contribution frequency by donor we can see that the vast majority of donors only contributied once as evidenced by this long tailed plot
## Warning: Removed 843 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
Next, reviewing the contribution frequency by occupation, we can see that Retirees are the most common donor far beyond any other group
## # A tibble: 25 x 2
## contbr_occupation n
## <chr> <int>
## 1 RETIRED 43565
## 2 NOT EMPLOYED 10382
## 3 INFORMATION REQUESTED 8435
## 4 ATTORNEY 3316
## 5 HOMEMAKER 3232
## 6 PHYSICIAN 3035
## 7 TEACHER 3017
## 8 PROFESSOR 2529
## 9 ENGINEER 1633
## 10 SALES 1615
## # ... with 15 more rows
Next, I want to look into the distribution based upon zip Code.
There are 1,108 valid unique Ohio Zipcodes in this data set. As Zip codes are added over time, it’s difficult to say how many zip codes were in Ohio for the 2016 election. Using today’s number of Zip codes, of which there are 1,447 - 2016 shows 76.6 % of current zip codes contributed to the 2016 campaigns.
## [1] 1108
Despite that, the distribution is skewed to a much smaller number of zip codes I’ve pulled down the counties based upon zip code from ZipCodestoGo.com for the state of Ohio and loaded them in order to identify the county associated.
In doing so we see that ~51% of all contributions occurred within 1 of 5 counties.
Source: OhioDemographics.com
However, when we look to see the top 51% of contributions by total dollar amount, it’s contained within 3 counties.
Since the Republican National Convention was held in Cleveland which is in Cuyahoga county it seems reasonable that it could have edged out Franklin county for contribution dollars.
## # A tibble: 20 x 4
## county n prop cumulative
## <fct> <int> <dbl> <dbl>
## 1 Franklin 27831 0.168 0.168
## 2 Cuyahoga 21861 0.132 0.301
## 3 Hamilton 16971 0.103 0.403
## 4 Montgomery 8269 0.0500 0.453
## 5 Summit 7896 0.0478 0.501
## 6 Lucas 5867 0.0355 0.536
## 7 Butler 4663 0.0282 0.565
## 8 Delaware 4349 0.0263 0.591
## 9 Stark 4219 0.0255 0.617
## 10 Lorain 3411 0.0206 0.637
## 11 Warren 3388 0.0205 0.658
## 12 Greene 3036 0.0184 0.676
## 13 Clermont 2899 0.0175 0.694
## 14 Lake 2667 0.0161 0.710
## 15 Trumbull 2553 0.0154 0.725
## 16 Mahoning 2511 0.0152 0.740
## 17 Portage 2193 0.0133 0.754
## 18 Licking 1921 0.0116 0.765
## 19 Medina 1918 0.0116 0.777
## 20 Fairfield 1862 0.0113 0.788
## # A tibble: 20 x 4
## county sum prop cumulative
## <fct> <dbl> <dbl> <dbl>
## 1 Cuyahoga 3727887. 0.185 0.185
## 2 Franklin 3608771. 0.179 0.364
## 3 Hamilton 2813321. 0.139 0.503
## 4 Summit 894400. 0.0443 0.547
## 5 Stark 658080. 0.0326 0.580
## 6 Montgomery 627084. 0.0311 0.611
## 7 Delaware 573743. 0.0284 0.639
## 8 Lucas 542903. 0.0269 0.666
## 9 Butler 483375. 0.0239 0.690
## 10 Lorain 363453. 0.0180 0.708
## 11 Warren 359322. 0.0178 0.726
## 12 Lake 296861. 0.0147 0.741
## 13 Mahoning 277770. 0.0138 0.754
## 14 Clermont 266109. 0.0132 0.768
## 15 Geauga 257171. 0.0127 0.780
## 16 Medina 244647. 0.0121 0.792
## 17 Greene 234322. 0.0116 0.804
## 18 Belmont 221460. 0.0110 0.815
## 19 Trumbull 208703 0.0103 0.825
## 20 Portage 168624. 0.00835 0.834
Having also loaded in the cities that go along with the Zip codes, and having run into issues with invalid city names in the original data source (Zip codes, counties instead of cities and misspellings) I’ve grouped the counts by the cities associated with the zip_codes.csv loaded previously.
As there are numerous cities per county, the 51% line includes more. However Columbus, Cleveland, Dayton & Akron are all in the top spots.
To summarize the above, a sizeable number of donors make smaller donations, and the majority only donate once. Additionally, despite the Republicans having far more candidates, the Democrats outpaced the Republicans both in the number of contributions and in the total amounts raised.
Both the majority of individual contributions and the majority of the sum total came from a limited number of counties and cities. The standouts being the more populous counties and cities instead of the more rural counties and cities.
Franklin County accounted for 16.8% of all contributions, and 17.9% of the total contributions.
I created a two new variables; Party and electionType. I’ve also cleaned the zipcodes within the data set, cast the date variable from char to a date, and joined in a clean dataset with zipcodes from a reputable source.
The City variable had numerous faulty data points due to apparent data entry errors, and the zip code variable originally had rows with the wrong number of integers for a valid zip code.
Now we will begin to look at the relationship of multiple variables to one another
First, I want to see the break out of candidates by party which shows that the Republicans fielded a whopping 16 Candidates in the 2016 Presidential election
Knowing that, it’s unsurprising that the Republican party as a whole outraised all the other parties.
If we look at the number of contributions per candidate though, we see that the two main Democratic candidates received the lion’s share of individual contributions. Which is especially surprising for Bernie Sanders to have the 2nd highest as he did not participate in the General Election.
## # A tibble: 24 x 4
## # Groups: party_granular [5]
## party_granular cand_nm count sum
## <chr> <chr> <int> <dbl>
## 1 Democrat Clinton, Hillary Rodham 70726 6784770.
## 2 Republican Kasich, John R. 4752 4454873.
## 3 Republican Trump, Donald J. 26265 4015624
## 4 Democrat Sanders, Bernard 34514 1452268.
## 5 Republican Cruz, Rafael Edward 'Ted' 16032 1315968.
## 6 Republican Rubio, Marco 2457 786969.
## 7 Republican Carson, Benjamin S. 7943 725000.
## 8 Republican Bush, Jeb 274 217655
## 9 Republican Paul, Rand 789 117620.
## 10 Republican Fiorina, Carly 625 83575.
## # ... with 14 more rows
Bernie Sanders contribution place begins to make more sense as we look at the contributions by party and election type.
In order to see this in a more normalized fashion, here are the Democratic and Republican parties on a log10 scale.
## subset(fec, fec$party != "Other")$party: Democrat
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.08 11.00 25.00 78.46 50.00 2700.00
## --------------------------------------------------------
## subset(fec, fec$party != "Other")$party: Republican
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8 25.0 50.0 198.5 100.0 2700.0
The Republican party had the higher mean and median by more than double that of the Demcrat party.
Using a boxplot to see the distribution across all the candidates, we can see John Kasich had the largest interquartile range, George Pataki had the highest median while our front running candidates Hillary and Donald trump have the smallest interquartile ranges and the most outliers. These outliers represent contributors with a higher donation than the majority.
This makes sense given the difference in contributions overall for the front candidates.
Moving on to the distribution across the elections themselves and we can see that the majority of all contributions were during the Primary and not during the General Election.
Taking a closer look at each county and the donations per 1000 people we find that Belmont county donoted the most per capita of any county at a rate of $3.11 per person, or $3,111.37 per 1,000 people.
In fact the top 4 counties for Republican contributions out contributed the top Democrat county.
## # A tibble: 25 x 7
## # Groups: county [25]
## county party total Rank Population per_cap per_cap_th
## <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Belmont Republican 210033. 36 67505 3.11 3111.
## 2 Geauga Republican 184970. 29 94031 1.97 1967.
## 3 Delaware Republican 393076. 14 204826 1.92 1919.
## 4 Hamilton Republican 1458458. 3 816684 1.79 1786.
## 5 Monroe Republican 20788 87 13790 1.51 1507.
## 6 Auglaize Republican 67971. 51 45804 1.48 1484.
## 7 Stark Republican 548484. 8 371574 1.48 1476.
## 8 Washington Republican 88622. 41 60155 1.47 1473.
## 9 Paulding Republican 27377. 83 18760 1.46 1459.
## 10 Cuyahoga Republican 1806785. 2 1243857 1.45 1453.
## # ... with 15 more rows
## # A tibble: 25 x 7
## # Groups: county [25]
## county party total Rank Population per_cap per_cap_th
## <chr> <chr> <dbl> <dbl> <int> <dbl> <dbl>
## 1 Hamilton Democrat 1338035. 3 816684 1.64 1638.
## 2 Cuyahoga Democrat 1895638. 2 1243857 1.52 1524.
## 3 Athens Democrat 92529. 37 65818 1.41 1406.
## 4 Franklin Democrat 1784230. 1 1310300 1.36 1362.
## 5 Delaware Democrat 178217. 14 204826 0.870 870.
## 6 Geauga Democrat 67801. 29 94031 0.721 721.
## 7 Summit Democrat 368259. 4 541918 0.680 680.
## 8 Mahoning Democrat 152147. 12 229642 0.663 663.
## 9 Greene Democrat 99029. 18 167995 0.589 589.
## 10 Lucas Democrat 246308. 6 429899 0.573 573.
## # ... with 15 more rows
The rank column denotes the population rank of the particular county. In sorting the dataframe by population set we see that contrary to what we might expect, the most populous county does not donate the largest amount per person.
This boxplot shows that for the selected occupations, donations to the Republican candidate tend to be higher. Additionally, of the selected occupations only “Attorney” and “Physician” sent more donations to a Non-Republican candidate.
In summary, on average there were more contributions during the Primary Election than the General, Republican donors tend to give larger donations on average despite there being more Democrat donors overall.
Additionally the amount of contribution is not directly correlated to the population of the county.
The first multivariate plot I want to look into is how this data is spread geographically across the state. By mapping first the number of contributions against the lat/long from the zipcode dataset things look a little purple. Maybe more magenta than purple but not 100% clear.
If we change the geom size to be based upon contribution total per county, it becomes clear that the Republican party received the larger amounts in total than the Democrat party candidates.
The blue dots become isolated patches within a sea of red.
Now I want to look into the timelines of the conributions.
I’ve truncated the candidates list down to the candidates who were a party of the General Election Ballot in order to look for patterns in the contributions dates.
Some of the notable dates during this election include:
Those dates are marked in the following plot with a pruple dashed line.
While some peaks and valleys seem to align, I want to zoom in on the General election itself to better see what is aligning.
The 3 dashed purple lines represent the dates of the 3 Presidential Debates and the election date. The first 2 lines represent the nomination dates of the Democratic & Republican candidates respectively.
Both number of contrbutions and the total contributions certainly appear to coincide with the presidential debates. Interestingly, the count of contributions is a stark contrast between the Democratic and Republican Nominees.
There are a few peaks I’m curious about here. I’ve looked into the news cycles with the corresponding peaks and valleys and listed them below. I’ve selected news events that occured within the 7 days preceding the peak or valley.
Donald Trump
Hillary Clinton
While no causations or direct lines can truly be drawn between these events and the contributions, I do find it interesting to see how they align with what I would have expected to be incendiary issues.
Both candidates had a similar dip the week of September 25th following the first presidential debate. As such, I did not consider it divergent enough to investigate beyond the debate.
Throough further exploration into the dataset, we can see that despite the larger number of contributions by Democrats, the Republican contributions just blew away the Democrats by a mile.
Additionally, the timeline of contributions not only shows the week over week with some news points, but it also shows the absolute bottoming out of individual contributions to Donald Trump after the 2nd presidential debate.
The first plot I’ve selected is the total contributions by party overlaid on the map of Ohio. It really shows the disparity not only in the contribution amounts, but also int he geographical dispersion.
The largest groupings are the significant metropolitan areas of the State. The Upper right amaglamation being Cleveland which is where the Republican National Convention was held in 2016.
The second plot I’ve selected shows the # of contributions by party, per election phase. I was not expecting the difference between the Primary and the General elections to be that disparate. It leads me to believe there’s further data that could be used to inform this.
Why is there such a difference? Is there are larger advertisement budget up front, or is there a trend after the Primary where it no longer makes a significant difference? Or was this more indicative of traditionally Republican Donors “giving up” after Donald Trump was nominated?
The third plot that I’ve selected I found incredibly interesting is the contrast between the most populous counties and the average contribution by population. Franklin County is the Capital seat, Cuyahoga is home to Cleveland and Hamilton is home to Cincinnati.
After the top 3, it wasn’t even close. Likely also informed by a lower population and a higher contribution on average.
The data by itself does not provide a lot of information, most of anything interesting needs to be brought against something else. For instance, there is no gender value, nor education level, and the Occupation data is incredibly messy. Even the variables that I would have expected to be clean (zip codes) have invalid zipcodes, phone numbers, a county & that’s before getting into 5 digits vs 5+4.
I think a lot of insight could be gleaned if the occupation value was normalized. There are separate values for CEO, C.E.O, and Chief Executive Officer to name a few. Due to the sheer number of occupations, I was not able to get into cleaning them. This disparity is what caused me not to investigate the occupation further. It’s very possible that normalzing these names would lead to more detailed information.
Another thing that could be done is to use the US Censuse Zip Code Tract Area data to align demographic statistics at the ZCTA level which would allow for education level, income and industry insight.